AITopics | bias direction

Collaborating Authors

bias direction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Liang, Haoyu, Sun, Youran, Cai, Yunfeng, Zhu, Jun, Zhang, Bo

arXiv.org Artificial IntelligenceFeb-10-2025

The security issue of large language models (LLMs) has gained significant attention recently, with various defense mechanisms developed to prevent harmful outputs, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the distribution of text embedding model outputs is significantly biased with a large mean. Inspired by this observation, we propose novel efficient methods to search for universal magic words that can attack text embedding models. The universal magic words as suffixes can move the embedding of any text towards the bias direction, therefore manipulate the similarity of any text pair and mislead safeguards. By appending magic words to user prompts and requiring LLMs to end answers with magic words, attackers can jailbreak the safeguard. To eradicate this security risk, we also propose defense mechanisms against such attacks, which can correct the biased distribution of text embeddings in a train-free manner.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2501.1828

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Tong, Schrasing, Zemour, Eliott, Lohanimit, Rawisara, Kagal, Lalana

arXiv.org Artificial IntelligenceDec-2-2024

Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.01711

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Evaluating Metrics for Bias in Word Embeddings

Schröder, Sarah, Schulz, Alexander, Kenneweg, Philip, Feldhans, Robert, Hinder, Fabian, Hammer, Barbara

arXiv.org Artificial IntelligenceSep-12-2024

Over the last years, word and sentence embeddings have established as text preprocessing for all kinds of NLP tasks and improved the performances significantly. Unfortunately, it has also been shown that these embeddings inherit various kinds of biases from the training data and thereby pass on biases present in society to NLP solutions. Many papers attempted to quantify bias in word or sentence embeddings to evaluate debiasing methods or compare different embedding models, usually with cosine-based metrics. However, lately some works have raised doubts about these metrics showing that even though such metrics report low biases, other tests still show biases. In fact, there is a great variety of bias metrics or tests proposed in the literature without any consensus on the optimal solutions. Yet we lack works that evaluate bias metrics on a theoretical level or elaborate the advantages and disadvantages of different bias metrics. In this work, we will explore different cosine based bias metrics. We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics. Furthermore, we thoroughly investigate the existing cosine-based metrics and their limitations to show why these metrics can fail to report biases in some cases. Finally, we propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.

bias score, occupation, weat, (16 more...)

arXiv.org Artificial Intelligence

2111.07864

Country:

Africa > Eswatini > Manzini > Manzini (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > North Rhine-Westphalia (0.04)

Genre: Research Report > New Finding (0.47)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Adila, Dyah, Zhang, Shuai, Han, Boran, Wang, Yuyang

arXiv.org Artificial IntelligenceJun-5-2024

The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model's internal representation. Our approach, SteerFair, finds the bias direction in the model's representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.

bias direction, dataset, discovering bias, (14 more...)

arXiv.org Artificial Intelligence

2406.03631

Country:

Europe > Austria > Vienna (0.14)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.05)
Asia > India > Tamil Nadu > Chennai (0.05)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Semantic Properties of cosine based bias scores for word embeddings

Schröder, Sarah, Schulz, Alexander, Hinder, Fabian, Hammer, Barbara

arXiv.org Artificial IntelligenceJan-27-2024

In the domain of Natural Language Processing (NLP), many works have investigated social biases in terms of associations in the embeddings space. Early works [1, 2] introduced methods to measure and mitigate social biases based on cosine similarity in word embeddigs. With NLP research progressing to large language models and contextualized embeddings, doubts have been raised whether these methods are still suitable for fairness evaluation [3] and other works criticize that for instance the Word Embedding Association Test (WEAT) [2] fails to detect some kinds of biases [4, 5]. Overall there exists a great deal of bias measures in the literature, which not necessarily detect the same biases [6, 4, 5]. In general, researchers are questioning the usability of model intrinsic bias measures, such as cosine based methods [7, 8, 9]. There exist few papers that compare the performance of different bias scores [10, 11] and works that evaluate experimental setups for bias measurement [12]. However, to our knowledge, only two works investigate the properties of intrinsic bias scores on a theoretical level [5, 13]. To further close this gap, we evaluate the semantic properties of cosine based bias scores, focusing on bias quantification as opposed to bias detection. We make the following contributions: (i) We formalize the properties of trustworthiness and comparability as requirements for cosine based bias scores.

bias direction, bias score, direct bias, (14 more...)

arXiv.org Artificial Intelligence

2401.15499

Country:

North America > Dominican Republic (0.04)
Europe > Germany > North Rhine-Westphalia (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

The SAME score: Improved cosine based bias score for word embeddings

Schröder, Sarah, Schulz, Alexander, Kenneweg, Philip, Feldhans, Robert, Hinder, Fabian, Hammer, Barbara

arXiv.org Artificial IntelligenceOct-24-2022

Over the last years, word and sentence embeddings have established as text preprocessing for all kinds of NLP tasks and improved performances in these tasks significantly. Unfortunately, it has also been shown that these embeddings inherit various kinds of biases from the training data and thereby pass on biases present in society to NLP solutions. Many papers attempted to quantify bias in word or sentence embeddings to evaluate debiasing methods or compare different embedding models, often with cosine-based scores. However, some works have raised doubts about these scores showing that even though they report low biases, biases persist and can be shown with other tests. In fact, there is a great variety of bias scores or tests proposed in the literature without any consensus on the optimal solutions. We lack works that study the behavior of bias scores and elaborate their advantages and disadvantages. In this work, we will explore different cosine-based bias scores. We provide a bias definition based on the ideas from the literature and derive novel requirements for bias scores. Furthermore, we thoroughly investigate the existing cosine-based scores and their limitations in order to show why these scores fail to report biases in some situations. Finally, we propose a new bias score, SAME, to address the shortcomings of existing bias scores and show empirically that SAME is better suited to quantify biases in word embeddings.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2203.14603

Country:

Africa > Eswatini > Manzini > Manzini (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > North Rhine-Westphalia (0.04)

Genre: Research Report > New Finding (0.47)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Robustness and Reliability of Gender Bias Assessment in Word Embeddings: The Role of Base Pairs

Zhang, Haiyang, Sneyd, Alison, Stevenson, Mark

arXiv.org Artificial IntelligenceOct-27-2020

It has been shown that word embeddings can exhibit gender bias, and various methods have been proposed to quantify this. However, the extent to which the methods are capturing social stereotypes inherited from the data has been debated. Bias is a complex concept and there exist multiple ways to define it. Previous work has leveraged gender word pairs to measure bias and extract biased analogies. We show that the reliance on these gendered pairs has strong limitations: bias measures based off of them are not robust and cannot identify common types of real-world bias, whilst analogies utilising them are unsuitable indicators of bias. In particular, the well-known analogy "man is to computer-programmer as woman is to homemaker" is due to word similarity rather than societal bias. This has important implications for work on measuring bias in embeddings and related work debiasing embeddings.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2010.02847

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Eswatini > Manzini > Manzini (0.05)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback